85 research outputs found

    Featurebased method for document alignment in comparable news corpora

    Get PDF
    In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.

    Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models

    Full text link
    This study investigates machine translation between related languages i.e., languages within the same family that share linguistic characteristics such as word order and lexical similarity. Machine translation through few-shot prompting leverages a small set of translation pair examples to generate translations for test sentences. This procedure requires the model to learn how to generate translations while simultaneously ensuring that token ordering is maintained to produce a fluent and accurate translation. We propose that for related languages, the task of machine translation can be simplified by leveraging the monotonic alignment characteristic of such languages. We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations. Through automatic and human evaluation conducted on multiple related language pairs across various language families, we demonstrate that our proposed approach of decomposed prompting surpasses multiple established few-shot baseline approaches. For example, DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.Comment: EMNLP 2023 (Main, Long paper

    SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

    Full text link
    We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.Comment: 15 pages, 7 figure

    Measurement of the inclusive and dijet cross-sections of b-jets in pp collisions at sqrt(s) = 7 TeV with the ATLAS detector

    Get PDF
    The inclusive and dijet production cross-sections have been measured for jets containing b-hadrons (b-jets) in proton-proton collisions at a centre-of-mass energy of sqrt(s) = 7 TeV, using the ATLAS detector at the LHC. The measurements use data corresponding to an integrated luminosity of 34 pb^-1. The b-jets are identified using either a lifetime-based method, where secondary decay vertices of b-hadrons in jets are reconstructed using information from the tracking detectors, or a muon-based method where the presence of a muon is used to identify semileptonic decays of b-hadrons inside jets. The inclusive b-jet cross-section is measured as a function of transverse momentum in the range 20 < pT < 400 GeV and rapidity in the range |y| < 2.1. The bbbar-dijet cross-section is measured as a function of the dijet invariant mass in the range 110 < m_jj < 760 GeV, the azimuthal angle difference between the two jets and the angular variable chi in two dijet mass regions. The results are compared with next-to-leading-order QCD predictions. Good agreement is observed between the measured cross-sections and the predictions obtained using POWHEG + Pythia. MC@NLO + Herwig shows good agreement with the measured bbbar-dijet cross-section. However, it does not reproduce the measured inclusive cross-section well, particularly for central b-jets with large transverse momenta.Comment: 10 pages plus author list (21 pages total), 8 figures, 1 table, final version published in European Physical Journal

    Systematic Review of Potential Health Risks Posed by Pharmaceutical, Occupational and Consumer Exposures to Metallic and Nanoscale Aluminum, Aluminum Oxides, Aluminum Hydroxide and Its Soluble Salts

    Get PDF
    Aluminum (Al) is a ubiquitous substance encountered both naturally (as the third most abundant element) and intentionally (used in water, foods, pharmaceuticals, and vaccines); it is also present in ambient and occupational airborne particulates. Existing data underscore the importance of Al physical and chemical forms in relation to its uptake, accumulation, and systemic bioavailability. The present review represents a systematic examination of the peer-reviewed literature on the adverse health effects of Al materials published since a previous critical evaluation compiled by Krewski et al. (2007). Challenges encountered in carrying out the present review reflected the experimental use of different physical and chemical Al forms, different routes of administration, and different target organs in relation to the magnitude, frequency, and duration of exposure. Wide variations in diet can result in Al intakes that are often higher than the World Health Organization provisional tolerable weekly intake (PTWI), which is based on studies with Al citrate. Comparing daily dietary Al exposures on the basis of “total Al”assumes that gastrointestinal bioavailability for all dietary Al forms is equivalent to that for Al citrate, an approach that requires validation. Current occupational exposure limits (OELs) for identical Al substances vary as much as 15-fold. The toxicity of different Al forms depends in large measure on their physical behavior and relative solubility in water. The toxicity of soluble Al forms depends upon the delivered dose of Al+ 3 to target tissues. Trivalent Al reacts with water to produce bidentate superoxide coordination spheres [Al(O2)(H2O4)+ 2 and Al(H2O)6 + 3] that after complexation with O2•−, generate Al superoxides [Al(O2•)](H2O5)]+ 2. Semireduced AlO2• radicals deplete mitochondrial Fe and promote generation of H2O2, O2 • − and OH•. Thus, it is the Al+ 3-induced formation of oxygen radicals that accounts for the oxidative damage that leads to intrinsic apoptosis. In contrast, the toxicity of the insoluble Al oxides depends primarily on their behavior as particulates. Aluminum has been held responsible for human morbidity and mortality, but there is no consistent and convincing evidence to associate the Al found in food and drinking water at the doses and chemical forms presently consumed by people living in North America and Western Europe with increased risk for Alzheimer\u27s disease (AD). Neither is there clear evidence to show use of Al-containing underarm antiperspirants or cosmetics increases the risk of AD or breast cancer. Metallic Al, its oxides, and common Al salts have not been shown to be either genotoxic or carcinogenic. Aluminum exposures during neonatal and pediatric parenteral nutrition (PN) can impair bone mineralization and delay neurological development. Adverse effects to vaccines with Al adjuvants have occurred; however, recent controlled trials found that the immunologic response to certain vaccines with Al adjuvants was no greater, and in some cases less than, that after identical vaccination without Al adjuvants. The scientific literature on the adverse health effects of Al is extensive. Health risk assessments for Al must take into account individual co-factors (e.g., age, renal function, diet, gastric pH). Conclusions from the current review point to the need for refinement of the PTWI, reduction of Al contamination in PN solutions, justification for routine addition of Al to vaccines, and harmonization of OELs for Al substances

    Measurement of charged-particle event shape variables in inclusive root(s)=7 TeV proton-proton interactions with the ATLAS detector

    Get PDF
    The measurement of charged-particle event shape variables is presented in inclusive inelastic pp collisions at a center-of-mass energy of 7 TeV using the ATLAS detector at the LHC. The observables studied are the transverse thrust, thrust minor, and transverse sphericity, each defined using the final-state charged particles' momentum components perpendicular to the beam direction. Events with at least six charged particles are selected by a minimum-bias trigger. In addition to the differential distributions, the evolution of each event shape variable as a function of the leading charged-particle transverse momentum, charged-particle multiplicity, and summed transverse momentum is presented. Predictions from several Monte Carlo models show significant deviations from data

    Pan-cancer analysis of whole genomes

    Get PDF
    Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale(1-3). Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4-5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter(4); identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation(5,6); analyses timings and patterns of tumour evolution(7); describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity(8,9); and evaluates a range of more-specialized features of cancer genomes(8,10-18).Peer reviewe
    corecore